A Low-Resourced Peruvian Language Identification Model
نویسندگان
چکیده
Due to the linguistic revitalization in Perú through the last years, there is a growing interest to reinforce the bilingual education in the country and to increase the research focused in its native languages. From the computer science perspective, one of the first steps to support the languages study is the implementation of an automatic language identification tool using machine learning methods. Therefore, this work focuses in two steps: (1) the building of a digital and annotated corpus for 16 Peruvian native languages extracted from documents in web repositories, and (2) the fit of a supervised learning model for the language identification task using features identified from related studies in the state of the art, such as ngrams. The obtained results were promising (97% in average precision), and it is expected to take advantage of the corpus and the model for more complex tasks in the future.
منابع مشابه
EFL Teachers’ Corrective Feedback and Students’ Revision in a Peruvian University: A descriptive study
This study explored the EFL teachers’ written corrective feedback (CF) techniques and their EFL students’ ability to integrate the CF while revising their texts. A total of 72 EFL students and 4 EFL teachers participated in this study. The data were collected through explicitation interviews administered to teachers and students, as well as through students’ written productions. A content analy...
متن کاملEFL Teachers’ Corrective Feedback and Students’ Revision in a Peruvian University: A descriptive study
This study explored the EFL teachers’ written corrective feedback (CF) techniques and their EFL students’ ability to integrate the CF while revising their texts. A total of 72 EFL students and 4 EFL teachers participated in this study. The data were collected through explicitation interviews administered to teachers and students, as well as through students’ written productions. A content analy...
متن کاملModeling code-Switching speech on under-resourced languages for language identification
This paper presents an integration of phonotactic information to perform language identification (LID) in a mixed-language speech. A single-pass front-end recognition system is employed to convert the spoken utterances into a statistical occurrence of phone sequences. To process such phone sequences, a hidden Markov model (HMM) is utilized to build robust acoustic models that can handle multipl...
متن کاملLanguage identification of code Switching sentences and multilingual sentences of under-resourced languages by using multi structural word information
Language identification (LID) is a process to identify the languages used in a text or speech. Code switching is the switching of a language in a sentence or speech utterance. This paper focuses on LID of words in code switching sentences. Code switching can occur intersentential or intrasentential. The reasons why a writer switches from one language to another due to various reasons and among ...
متن کاملLanguage Identification for Under-Resourced Languages in the Basque Context
Automatic Speech Recognition (ASR) is a broad research area that absorbs many efforts from the research community. The interest on Multilingual Systems arouses in the Basque Country because there are three official languages (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. The development of...
متن کامل